On the Study of Diurnal Urban Routines on Twitter

نویسندگان

  • Mor Naaman
  • Amy Xian Zhang
  • Samuel Brody
  • Gilad Lotan
چکیده

Social media activity in different geographic regions can expose a varied set of temporal patterns. We study and characterize diurnal patterns in social media data for different urban areas, with the goal of providing context and framing for reasoning about such patterns at different scales. Using one of the largest datasets to date of Twitter content associated with different locations, we examine within-day variability and across-day variability of diurnal keyword patterns for different locations. We show that only a few cities currently provide the magnitude of content needed to support such acrossday variability analysis for more than a few keywords. Nevertheless, within-day diurnal variability can help in comparing activities and finding similarities between cities. Introduction Social media activity in different geographic regions expose a varied set of temporal patterns. In particular, Social Awareness Streams (SAS) (Naaman, Boase, and Lai 2010), available from social media services such as Facebook, Twitter, FourSquare, Flickr, and others, allow users to post streams of lightweight content artifacts, from short status messages to links, pictures, and videos, in a highly connected social environment. The vast amounts of SAS data reflect, in new ways, people’s attitudes, attention, and interests, offering unique opportunities to understand and draw insights about social trends and habits. In this paper, we focus on characterizing social media patterns in different urban areas (US cities), with the goal of providing a framework for reasoning about activities and diurnal patterns in different cities. Using Twitter as a typical SAS, previous research studied specific temporal patterns that are similar across geographies, in particular in respect to expression of mood (Golder and Macy 2011; Dodds et al. 2011). We aim to provide insights for reasoning about diurnal patterns in different geographic (urban) areas that can be used in studying activity patterns in these areas, going beyond previous work that had mostly examined topical differences between posts in different geographic areas (Eisenstein et al. 2010; Hecht et al. 2011) or briefly examined broad diurnal differences (Cheng et al. 2011) in vol∗Amy and Sam were at Rutgers at the time of this work. Copyright c © 2012, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. ume between cities. Such study can contribute to urban studies, with implications for diverse social challenges such as public health, emergency response, community safety, transportation, and resource planning as well as Internet advertising, providing insights and information that cannot readily be extracted from other sources. Developing such a framework presents a number of challenges, both technical and practical. First, SAS data (and in particular Twitter) has been shown to be quite noisy. Users of SAS post different type of content, from information and link sharing, to personal updates, to social interactions, and many others (Naaman, Boase, and Lai 2010). Can stable patterns be reliably extracted given this noisy environment? Second, reliably extracting the location associated with Twitter content is still an open problem, as we discuss below. Finally, Twitter content volume shifts over time as more users join the service, and fluctuates widely in response to breaking events and other happenings, from Valentine’s Day to the news about Bin Laden’s capture and demise. Such temporal volume fluctuations might distort otherwise stable patterns and make them difficult to extract. In this paper, therefore, we report on a study that extracts and reasons about stable temporal patterns from Twitter data. In particular, we: 1) use large scale data with manual coding to get a wide sample of tweets for different cities; 2) study within-day and across-day variability of patterns in cities; and 3) reason about differences between cities with respect to overall patterns as well as individual ones. Related Work Broadly speaking, this work is informed by two key areas of related work: the use of new technologies and data sources for urban studies, and studies of social media to extract “real world” insights, or temporal dynamics. Here we broadly address these areas, before discussing other recent research that directly informed our work. The related research area sometimes dubbed “urban sensing” (Cuff, Hansen, and Kang 2008) analyzes various new datasets to understand the dynamics and patterns of urban activity. Most prominently, mobile phone data, mainly proprietary data from wireless carriers (e.g., calls made and positioning data) help expose travel patterns and broad spatio-temporal dynamics, e.g., in (Gonzalez, Hidalgo, and Barabasi 2008). Social media was also used to augment our understanding of urban spaces. Researchers have used geotagged photographs, for example, to gain insight about tourist activities in cities (Ahern et al. 2007; Crandall et al. 2009; Girardin et al. 2008). Twitter data can augment and improve on these research efforts, and allow for new insights about communities and urban environments. More recently, as mentioned above, researchers had examined differences in social media content between geographies using keyword and topic models only (Eisenstein et al. 2010; Hecht et al. 2011). Cheng et al. (2011) examine patterns of “checkins” on Foursquare and briefly report on differences between cities regarding their global diurnal patterns. Beyond geographies and urban spaces, several research efforts have examined social media temporal patterns and dynamics. Researchers have examined daily and weekly temporal patterns on Facebook (Golder, Wilkinson, and Huberman 2007), and to some degree Twitter (Java et al. 2007), but did not address the stability of patterns, or the differences between geographic regions. Recently, Golder and Macy (2011) have examined temporal variation related to Twitter posts reflecting mood, across different locations, and showed that diurnal (as well as seasonal) mood patterns are robust and consistent in many cultures. The activity and volume measures we use here are similar to Golder and Macys, but we study patterns more broadly (in terms of keywords) and in a more focused geography (city-scale instead of timezone and country). Identifying and reasoning about repeating patterns in time series has, of course, long been a topic of study in many domains. Most closely related to our domain, Shimshoni et al. (Shimshoni, Efron, and Matias 2009) examined the predictability of search query patterns using day-level data. In their work, the authors model seasonal and overall trend components to predict search query traffic for different keywords. The “predicitability” criteria, though, is arbitrary, and only used to compare between different categories of use. Definitions We begin with a simple definition of the Twitter content as used in this work. Users are marked u ∈ U , where u can be minimally modeled by a user ID. However, the Twitter system features additional information about the user, most notably their hometown location `u and a short “profile” description. Content items (i.e., tweets) are marked m ∈ M , where m can be minimally modeled by a tuple (um, cm, tm, `m) containing the identity of the user posting the message um ∈ U , the content of the message cm, the posting time tm and, increasingly, the location of the user at the time of posting `m. This simple model captures the essence of the activity in many different SAS platforms, although we focus on Twitter in this work. Using these building blocks, we now formalize some aggregate concepts that will be used in this paper. In particular, we are interested in content for a given geographic area, and examining the content based on diurnal (hourly) patterns. We use the following formulations: • MG,w(d, h) are content items associated with a word w posted in geographic area G during hour h of day d. • XG,w(d, h) = |MG,w(d, h)| defines a time series of the volume of messages associated with keyword w in location G. In other words, XG,w(d, h) would be the volume of messages (tweets) in region G that include the word w and were posted during hour h = 0 . . . 23 of day d = 1 . . . N , where the N is the number of the days in our dataset. We describe below how these variables are computed. Data Collection In this section, we describe the data collection and aggregation methodology. We first show how we collected a set of tweets MG for every location G in our dataset. We then describe how we select a set of keywords w for the analysis, and compute the XG,w(d, h) time series data for each keyword and location. The data was extracted from Twitter using the Twitter Firehose access level, which, according to Twitter, “Returns all public statuses.” Our initial dataset included all public tweets posted from May 2010 through May 2011. Tweets from Geographic Regions To reason about patterns in various geographic regions, we need a robust dataset of tweets from each region G to create the sets MG,w. Twitter offers two possible sources of location data. First, a subset of tweets have associated geographic coordinates (i.e., geocoded, containing an `m fields as described above). Second, tweets may have the location available from the profile description of the Twitter user that posted the tweet (`u, as described above). The `m location often represents the location of the user when posting the message (e.g., attached to tweets posted from a GPS-enabled phone), but was not used in our study since only a small and biased portion of the Firehose dataset (about 0.6%) includes `m geographic coordinates. The rest of the paper uses location information derived from the user profile, as described next. Note that the profile location is not likely to be updated as the users move around space: a “Honolulu” user will appear to be tweeting from their hometown even when they are (perhaps temporarily) in Boston. Overall, though, the profile location will reflect tendencies albeit with moderate amount of noise. We used location data associated with the profile of the user posting the tweet, `u, to create our datasets for each location G. A significant challenge in using this data is that the location field on Twitter is a free text field, resulting in data that may not even describe a geographic location, or describe one in an obscure, ambiguous, or unspecific manner (Hecht et al. 2011). However, according to Hecht et al., 1) About 19 of users had an automatically-updated profile location (updated by mobile Twitter apps such as Ubertweet for Blackberry), and 2) 66% of the remaining users had at least some valid location information in their profile; about 70% of those had information that exceeded the city-level data required for this study. In total, the study suggests that 57% of users would have some profile location information appropriate for this study. As Hecht et al. (2011) report, this user-provided data field may still be hard to automatically and robustly associate with a real-world location. We will overcome or avoid some of these issues using our method specified below.1 To create our dataset, we had to resolve the free-text user profile location `u associated with a tweet to one of the cities (geographic areas G) in our study. Our solution had the following desired outcomes, in order of importance: • high precision: as little as possible content from outside that location. • high recall: as much as possible content without compromising precision. In order to match the user profile location to a specific city in our dataset, we used a dictionary-based approach. In this approach: 1. We used a different Twitter dataset to generate a comprehensive dictionary of profile location strings that match each location in our dataset. For example, strings such as ”New York, NY” or ”NYC” can be included in the New York City dictionary. 2. We match the user profile associated with the tweets in our dataset against the dictionary. If a profile location `u matches one of the dictionary strings for a city, the tweet is associated with that city in our data. We show next how we created a robust and extensive location dictionary for each city. Generating Location Dictionaries The goal of the dictionary generation process was to establish a list of strings for each city that would reliably (i.e., accurately and comprehensively) represent that location, resulting in the most content possible, with as little noise as possible. To this end, we obtained an initial dataset of tweets that are likely to have been generated in one of our locations of interest. We used a previously collected dataset of Twitter messages for the cities in our study. This dataset included, for every city, represented via a point-radius geographic region G, tweets associated with that region. The data was collected from September 2009 until February 2011 using a point-radius query to the Twitter Search API. For such queries, the Twitter Search API returns a set of of geotagged tweets, and tweets geocoded by Twitter to that area (proprietary combination of user profile, IP, and other signals), and that match that point-radius query. For every area G representing a city, we thus obtained a set LG of tweets posted in G, or geocoded by Twitter to G. Notice that other sources of approximate location data (e.g., a strict dataset of geocoded Twitter content from a location) can be alternatively used in the process described below. From the dataset LG, we extracted the profile location `u for the user posting each tweet. For each city, we sorted the various distinct `u strings by decreasing popularity (the number of unique tweets). For example, the top `u strings for In an alternative approach, a user’s home location can be estimated to some degree from the topics they post about on their account (Cheng, Caverlee, and Lee 2010; Eisenstein et al. 2010; Hecht et al. 2011). For example, Cheng et al. (2010) claim 51% within-100-miles accuracy for users in their study. 0! 0.2! 0.4! 0.6! 0.8! 1! 0! 20 0! 40 0! 60 0! 80 0! 10 00 ! 12 00 ! 14 00 ! 16 00 ! 18 00 ! Honolulu, HI! 0! 0.2! 0.4! 0.6! 0.8! 1! 0! 20 0! 40 0! 60 0! 80 0! 10 00 ! 12 00 ! 14 00 ! 16 00 ! 18 00 ! New York, NY! 0! 0.2! 0.4! 0.6! 0.8! 1! 0! 20 0! 40 0! 60 0! 80 0! 10 00 ! 12 00 ! 14 00 ! 16 00 ! 18 00 ! San Francisco, CA! 0! 0.2! 0.4! 0.6! 0.8! 1! 0! 20 0! 40 0! 60 0! 80 0! 10 00 ! 12 00 ! 14 00 ! 16 00 ! 18 00 ! Richmond, VA! Figure 1: CDFs of location field values for the most popular 2000 location strings for four cities the point-radius area representing New York City included the strings New York (14,482,339 tweets), New York, NY (6,955,481), NYC (6,681,874) and so forth, including also subregions of New York City such as Harlem (587,430). We proceeded to clean the top location strings lists for each city from noisy, ambiguous, and erroneous entries. Because the distribution of `u values for each location was heavy-tailed, we chose to only look the top location strings that comprised 90% of unique tweets from each city. Figure 1 shows the cumulative distribution function (CDF) for the location string frequencies for four locations in our dataset. The x-Axis represents the location strings, ordered by popularity. The y-Axis is the cumulative portion of tweets the top strings account for. For example, the 500 most popular location terms in New York account for about 82% of all tweets collected for New York in the LG dataset. Then, for each city, we manually filtered the lists for each city, marking entries that are ambiguous (e.g., “Uptown” or “Chinatown” that Twitter coded as New York City) or incorrect, due to geocoder issues, e.g. “Pearl of the Orient” that the Twitter geocoder associates with Boston, for one reason or another; see Hecht and Chi’s (2011) discussion for more on this topic. This annotation process, performed by one of the authors, involved lookups to external sources if needed to determine which terms were slang phrases or nicknames for a city as opposed to errors. For this study, we selected a set of US state capitals, adding a number of large US metropolitans that are not state capitals (such as New York and Los Angeles), as well as London, England. This selection allowed us to study cities of various scales and patterns of Twitter activity. As mentioned above, using these lists of `u location strings for the cities, we filtered the Twitter Firehose dataset to extract tweets from users whose location fields exactly matched a string on one of our cities’ lists. Using this process, we generated datasets of tweets from each city, summarized in Figure 2. The figure shows, for each city, the number of tweets in our 380-day dataset (in millions). For example, we have over 150 million tweets from Los Angeles, an average of about 400,000 a day. Notice the sharp drop in the number of tweets between the large metropolitans to local centers (Lansing, Michigan with 2.5 million tweets, an average of 6,500 per day). We only kept locations with over 2.5 million tweets, i.e. the 29 locations shown in Figure 2. 0" 100" 200" 300" New"York,"NY" London,"England" Los"Angeles,"CA" Atlanta,"GA" San"Francisco,"CA" Boston,"MA" Washington"DC" Olympia,"WA" Annapolis,"MD" Pheonix,"AZ" Columbus,"OH" AusMn,"TX" Indianapolis,"IN" Denver,"CO" Richmond,"VA" Saint"Paul,"MN" Nashville,"TN" Sacramento,"CA" Tallahassee,"FL" Oklahoma"City,"OK" Frankfort,"KY" Baton"Rouge,"LA" Raleigh,"NC" Salt"Lake"City,"UT" LiWle"Rock,"AR" Honolulu,"HI" Charleston,"WV" Lansing,"MI" Columbia,"SC"

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Examination of Emergency Medicine Physicians’ and Residents’ Twitter Activities During the First Days of the COVID-19 Outbreak

Introduction: Social media has become an important element of interaction and found itself a place in every aspect of our lives. This study examined the twitter activities of emergency medicine physicians and residents (EMP&R;) about the COVID-19 outbreak. Methods: The study concentrated on Twitter, a major social media network. To identify accounts owned ...

متن کامل

Investigation of changes in surface urban heat-island (SUHI) in day and night using multi-temporal MODIS sensor data products (Case Study: Tehran metropolitan)

The term urban heat island (UHI), described the phenomenon of climate change in urban areas compared with surrounding rural areas. UHI effects include: increasing in energy and water consumption, air pollution expansion and interfering in thermal comfort. Surface urban heat island (SUHI) contains patterns of land surface temperature (LST) in urban areas that has interaction with UHI in urban ca...

متن کامل

A High-Performance Model based on Ensembles for Twitter Sentiment Classification

Background and Objectives: Twitter Sentiment Classification is one of the most popular fields in information retrieval and text mining. Millions of people of the world intensity use social networks like Twitter. It supports users to publish tweets to tell what they are thinking about topics. There are numerous web sites built on the Internet presenting Twitter. The user can enter a sentiment ta...

متن کامل

Evaluation of the diurnal intraocular pressure fluctuations and blood pressure under dehydration due to fasting

Introduction: This study aimed to investigate the diurnal intraocular pressure fluctuations under dehydration conditions and the relationship between the intraocular pressure fluctuations and blood pressure. Methods: The intraocular pressures (IOP), body weights, as well as systolic and diastolic blood pressures (SBP, DBP) of 36 fasting healthy volunteers were recorded at 8:00 a.m. and 5:00 p.m...

متن کامل

A Model for Detecting of Persian Rumors based on the Analysis of Contextual Features in the Content of Social Networks

The rumor is a collective attempt to interpret a vague but attractive situation by using the power of words. Therefore, identifying the rumor language can be helpful in identifying it. The previous research has focused more on the contextual information to reply tweets and less on the content features of the original rumor to address the rumor detection problem. Most of the studies have been in...

متن کامل

Detection of Twitter Users' Attitudes about Flu Vaccine based on the Content and Sentiment Analysis of the Sent Tweets

Introduction: The influenza vaccine is one of the controversial challenges in today's societies. Considering the importance of using the flu vaccine in preventing the spread of influenza virus, the Twitter network, as a rich source of data, provides suitable conditions for research in this field to examine the attitudes of different people about this vaccine. The results in one hand will help h...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012